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Abstract We consider the problem of learning a target probability distribution over a set of N binary 
variables from the knowledge of the expectation values (with this target distribution) of M observables, 
drawn uniformly at random. The space of all probability distributions compatible with these M expec¬ 
tation values within some fixed accuracy, called version space, is studied. We introduce a biased measure 
over the version space, which gives a boost increasing exponentially with the entropy of the distributions 
and with an arbitrary inverse ‘temperature’ T. The choice of r allows us to interpolate smoothly be¬ 
tween the unbiased measure over all distributions in the version space (T = 0) and the pointwise measure 
concentrated at the maximum entropy distribution (T —> oo). Using the replica method we compute the 
volume of the version space and other quantities of interest, such as the distance R between the target 
distribution and the center-of-mass distribution over the version space, as functions of a = (log M)/N 
and r for large N. Phase transitions at critical values of a are found, corresponding to qualitative im¬ 
provements in the learning of the target distribution and to the decrease of the distance R. However, for 
fixed a, the distance R does not vary with r, which means that the maximum entropy distribution is 
not closer to the target distribution than any other distribution compatible with the observable values. 
Our results are confirmed by Monte Carlo sampling of the version space for small system sizes (N < 10). 

Keywords Probabilistic inference • Maximum entropy principle ■ Replica method 


1 Introduction 

Multi-components and strongly interacting systems, in physics and beyond, may show complex behaviours 
eluding simple quantitative modeling. A common strategy to describe such systems is to define probability 
distributions over the space of their configurations. The task is at first sight daunting. The number of 
unknown probabilities scales as the dimension of the configuration space, and is enormous, generally 
exponentially large in the number of the degrees of freedom defining the system. The selection of one 
probability distribution among the multitude of possibilities may be done in a Bayesian way. A class of 
parametrized model distributions is considered, and a good choice of the parameters is sought, e.g. which 
maximizes the likelihood of the observed data. Statisticians are generally interested in understanding 
how learning proceeds, that is, in the speed of convergence to the target distribution (assumed to be 
in the parametrized class, and to be consistent with one value of the parameters) as more and more 
data are made available. Yet, though the number of parameters defining the class of distributions may 
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be very large and the inference problem may be computationally very hard, the classes of parametrized 
distributions are generally extremely small compared to the plethora of possible distributions over the 
configuration space. While some classes of distributions may appear more adequate to represent the data 
or more amenable to computations, those criteria are largely arbitrary. 

The Maximum Entropy (ME) principle is an alternative approach, rejecting the notion of arbitrariness. 
The ME principle may be informally stated as follows: among all possible distributions compatible with 
what is known of the data, choose the one with ME. It was proposed as an alternative foundation of 
statistical mechanics m-, as an illustration, the Boltzmann distribution in the canonical ensemble may 
be found back as the distribution with ME under the sole knowledge of the average value of the energy. 
The ME principle is supported by information theory [3], which argues that any other distribution would 
be too constrained, and, in fact, reflect additional properties of the data 0]. An alternative and somewhat 
colourful formulation of this argument was given in [5], where the ME distribution was shown to emerge as 
the most frequent distribution which an uninformed operator (monkeys in [5] ) would find, upon repeated 
and unbiased trials compatible with the observations on the data. In other words, given an unbiased 
prior over the set of all possible distributions over the configuration space, the ME distribution is the 
most likely one compatible with our knowledge of the data. Furthermore the ME distribution enjoys 
important and valuable properties, e.g. a weak sensitivity to measurement errors [B]. To the practitioner, 
the most compelling argument in support of the ME distribution may actually come from the successful 
applications to experimental data, see for instance [7| for a clear presentation regarding biological data. 

The purpose of the present work is to compare the performances of the ME distribution with the 
ones of the other distributions compatible with the data. While the literature on the ME principle has 
a rich history, we are not aware of works attempting to carry out this comparison in a quantitative and 
rather general setting. We consider a set of N variables, taking binary values ±1 (the generalization to a 
larger number of states would be straightforward, as long as it remains finite). Any distribution over the 
space of configurations is entirely characterized by 2 N probabilities, which are non-negative real numbers 
summing to unity. Each distribution may therefore be seen as a point in the 2 N -dimensional simplex. 
We pick up one point in this simplex, hereafter referred to as the target distribution. Next we consider 
a set of M observables, each of which may be any polynomial function of the variables, and compute 
their average values over the target distribution. The distributions in the simplex compatible with those 
data, i.e. such that the observables have the same average within some prescribed accuracy, are called 
admissible. The set of admissible distributions, called version space, contains the target distribution, the 
ME distributions, and many others, such as the center-of-mass distribution, which is the flat average over 
all admissible distributions. Our objective is to compute the main geometrical features of the version 
space, e.g. its volume, the distances from the target to the ME or to the center-of-mass distributions, the 
average inter-distribution distance, .... 

To give a precise meaning to those quantities in a mathematically tractable framework we will assume 
that observables are drawn from a simple statistical ensemble. More precisely the values taken by the 
observables are assumed to be random and uncorrelated across the 2 N configurations. This assumption 
is not meant to be realistic. In most real applications, indeed, observables reflect the low-order statistics 
of the configurations, e.g. the value of the first variable, the product of the fifth and seventh variables, .... 
Such observables vary very smoothly over the configuration space, as they depend only a small (compared 
to N) number of the configuration variables, and may be adequate to provide information about smooth 
distributions. Our hypothesis can be considered as worst-case-like in the following sense. The inference 
of the target distribution from the average values of the observables may be recast as the problem of 
reconstructing a 2 N dimensional non-negative vector from the knowledge of its scalar products with M 
vectors (corresponding to the observables). When those vectors are randomly chosen the scalar products 
are typically very small, of the order of 2~ w / 2 , and are weakly informative about the direction of the target 
vector. In this pessimistic setting we expect that the number of scalar products (observables) necessary 
to reconstruct the target vector with accuracy will be of the order of the number of relevant components 
of the target vector, that is, of the order of the exponential of the entropy of the target distribution. 
While this statement is correct we will see that some important features of the target distribution are 
correctly inferred even with a much smaller number M of observables. In particular, the probabilities of 
configurations with large values of the target distribution are learned with a limited number of data, a 
phenomenon connected to the onset of phase transitions in the learning process. 

An important aspect of our framework is that it allows us to bias the measure over the version space, 
in order to boost the distributions with large entropies. The magnitude of this entropic bias may be 



Learning probabilities: maximum entropy and others 


3 


continuously tuned from zero (uniform measure over the version space) to infinity, which amounts to 
selecting the ME distribution alone. We study how the distance between the target distribution and 
center-of-mass distribution varies with the bias (for a fixed number M of observables). Our main result is 
that this distance does not depend on the bias, showing that the ME entropy is not better than any other 
distribution randomly picked up in the version space. While this result is valid for any target distribution 
in the case of random observables we do not expect it to apply to the more realistic case where both the 
observables and the target distribution are smooth. 

All our results are derived within the replica symmetric framework when N and M go to infinity 
at fixed ratio a = (log M)/N, and are therefore non rigorous. We have, however, checked the local 
stability of our replica-symmetric solution against replica-symmetry-breaking fluctuations, the so-called 
replicon modes. Our results are therefore self-consistent and we expect them to be correct for large N, M. 
In addition we have designed a Monte Carlo algorithm to sample the version space, and applied it to 
small system sizes (N < 10). Remarkably, simulations show only weak finite-size effects compared to our 
large -N calculations; a good qualitative (and sometimes even quantitative) agreement with our analytical 
predictions is found. 

The paper is organized to be accessible to the reader not interested in the details of our calculation. In 
sec. [21 we present the necessary definitions and notations. An overview of our results, free of technicalities, 
is given in sec. [3] All technical details and calculations are reported in sec. [4] and in the Appendix. We 
present the sampling algorithm and the results of our numerical simulations in sec. [5] Conclusions can 
be found in sec. 


2 Definitions and notations 

2.1 Target distribution 

Let us consider a system consisting of N Ising spins, with configurations s = {sj = ±l}jY_i- The prob¬ 
ability distributions of the system configurations, hereafter called target distribution, is denoted by p s . 
We consider large-size systems, N 1, and write 

Ps = e~ Nu>a , (1) 

where the rate ui a > 0 of the configuration probability p s is introduced, and the symbol = stands for 
equality in the leading exponential-in-iV term. It is convenient to introduce the entropy a of the rates 


■M = 


lim lim — log 

e—>0+ N—too N 


nb. of configurations s such that e 


-Nu 


< p s < e 


— N(oj—e) 


( 2 ) 


The entropy curve a(u) has some remarkable features in standard physical systems. First, it is convex 
and bounded from above and below. Secondly, the maximum of the curve is do = log 2; the corresponding 
value of ui is denoted by uiq. Thirdly, the curve lies below the a = co line, and is tangent to this line at 
u>i(< wq). The value o\ = <r(wi) is the entropy per spin of the target distribution 


<Ji 


lim 

N—t oo 


l 

N 


^2 Ps log p s . 


( 3 ) 


More generally, given a real number k : we define Wfc, (Jfc, and i k through 

= k, cr fc = cr(wfc) , £ k = ku} k -a k - (4) 

U)=UJk 

Note that the range of values of k such that u} k is well-defined is generally bounded. These quantities 
characterize the dominant contribution to the kth. moment of p: 

e -m k 


J2(Ps) k = J dcoe N ^ u) ~ kc 


0 ^ e N(a k -kui k ) _ 


da 

duj 


( 5 ) 
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As an illustration of the properties above we show in Fig. [T] the entropy curve of the independent spin 
model (ISM) defined as 


N 


pf M = 


(2 cosh H) N 


ex P \ H ^2 s d > 


i= 1 


for H = 0.5. The value of H will not change throughout this paper. 


( 6 ) 



Fig. 1 Entropy a as a function of uj for the independent spin model in eq. l[fi} with H = 0.5. The entropy curve is obtained 
through a parametric representation ui(m) = log(2 cosh H) — Hm, aim) = log 2 — 1(1 + m) log(l + m) — 1(1 — m) log(l — m), 
where m is the magnetization per spin, ranging from —1 to +1. It is easy to show that = u(tanh(kH)). 


2.2 Observables, version space, and maximum entropy distribution 

Let v be an observable, taking value v s when the system configuration is s. Observables and probability 
distributions can be seen as vectors in a 2 N -dimensional space, with components labelled by the config¬ 
urations s. We assume that measurements give us access to the average value of the observable over the 
target distribution, 


^VsPs^V ■ p, (7) 

S 

which may be simply written as the scalar product of the vectors attached to the observable and to the 
target distribution. Suppose we have made M measurements corresponding to M observables v M (p = 
1, • ■ • , M). An ‘admissible’ probability vector, p , compatible with all the measurements is such that 

(p - p) ■ V* = 0 , V p = 1, ■ ■ • , M. (8) 

We hereafter use the term ‘constraints’ to refer to eq. m or to the attached observables . The set of 
vectors satisfying eq. ©, together with the normalization and the non-negativity conditions 

^2,Vs = 1 and p s >0,Vs, (9) 


defines the version space. 

Each distribution p in the version space is characterized by its Shannon entropy, 

S{p) = -^]p s logp s , 

S 


(10) 
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which may be interpreted as an estimate of the logarithm of the effective number of configurations under 
the distribution. The maximum entropy (ME) distribution, Pme, is the distribution maximizing eq. (1101) 
in the version space, i. e. under the constraints listed in eq. (0) and in eq. ©• Using Lagrange multipliers 
p M to enforce those constraints the ME distribution may be formally written as 


P 


ME 

s 



( 11 ) 


where C is a normalization constant. As an illustration, if we consider the set of the N single-spin 
observables, v s = Si, 1 < * < N, and of the N(N — l)/2 two-spin observables v s = .sy Sj, 1 < i < j < N, 
we recover the well-known result that the Ising model is the ME model given the average values of those 
observables. The Lagrange multipliers rj coincide with, respectively, the N fields and the N(N — l)/2 
pairwise couplings acting on the spins. 


2.3 Measures over the space of distributions 

In realistic situations involving certain measurement noise, it may be beneficial not to perfectly reproduce 
the given average value by avoiding overfitting. To take into account such a flexibility, we introduce a 
Gaussian-like measure with variance E over the vector space of p: 

p(p\{ v X=^p) = ±I[e(Ps)s ~ 1 j exp | _ 2^ ' (P”P)) 2 |> (I 2 ) 

where 9{x) is the step function, equal to 1 if x > 0 and to 0 if x < 0, and 8(x) is the Dirac delta function. 
We will call E tolerance hereafter. The denominator V is defined to make sure that the measure p is 
normalized: 

dp s <5 

This normalization factor measures the volume of ‘admissible’ probability vectors given the constraints 
eq. ©. In this probabilistic setting we will loosely use the term ‘version space’ to refer to the set of 
distributions p associated to ‘large’ measure values p(p). Note that, while p(p) defines the joint-measure 
of the 2 n -configuration probabilities, we will also consider below the marginal measure for a single¬ 
configuration probability, 


pOO 

V(E,{v»}ff = 1 ,p) = / 

Jo 


J2 ps ~ 1 ) exp l~^?H ^ ■ (p-p)) 2 } • ( 13 ) 


pOO 

p s (ps\{v^ =1 ,p)= n d Pt p{p\{i 

t&*) 


X^p) 


(14) 


It is possible to consider other measures of the space of p for the purpose of studying the performances 
of the ME distribution. To favor the probability vectors p with large Shannon entropies S(p), see eq. 
GHD, it is natural to introduce the new measure 

p{p\r,{vX=i,P) * p(p\{vX=1’P) . (! 5 ) 

where r is the strength of this entropic bias. The normalization of this new measure p implicitely defines 
the following expression for the volume in the presence of an entropic bias, 

dp s S 

For r = 0 we recover the ‘unbiased’ measure in eq. while the limit r —> oo retains the ME 

distribution only. 


pOO 

v(r,E,{vX=i,p) = 

Jo 


j^ps - 1 j exp | • (p - p)f+ rs (p) \ ■ ( i6 ) 
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2.4 Randomization of observables 


Hereafter we assume that the M constraints are randomly and independently chosen from the 

following Gaussian distribution over the space of 2 N -dimensional vectors v: 


p (v )= n 

S 



(17) 


As mentioned in the introduction we do not pretend that this assumption is realistic. Indeed in practice, 
observables are often chosen to be low-order moments of the target distribution such as magnetizations 
or pairwise spin correlations, which is quite distinct from the Gaussian distribution above. 

In contrast to the constraints { v which are randomly drawn, we do not choose any particular 
statistical ensemble for the target distribution p; The only properties of p we need for the analysis is the 
existence of the entropy curve discussed above. We stress that these constraints appearing in the 

measure p over the distribution space, are quenched random variables, drawn once for all. 

Since each constraint contributing a multiplicative factor l 2E to the volume V in eq. ED, 

we expect that the logarithm of this volume will be self-averaging in the larg e-M limit. We will therefore 
calculate the average value of log V over the constraints, using the replica method. 


3 Overview of results 

We hereafter report the main outcomes of our replica calculation in an informal way. All technical details 
are postponed to sec. |U 


3.1 Order parameters: interpretation and scaling with N 


Given the set of constraints and the target distribution p we may draw a schematic picture of 

the version space such as the one shown in Fig. [2] In addition to the target distribution, two distributions 
of interest in the version space are the ME distribution, p ME , and the center-of-mass distribution, (p), 
where the angular brackets denote the average over the measure p: 

(p) = J dp p{p) p . (18) 

The following quantities are useful to characterize the ‘distances’ between the distributions in the version 
space, see Fig. [2] 

= J2i{Ps) -Paf , Diiv^^p) = ((P*> - (Paf) • (19) 


R measures the ^-squared distance between the target and the center-of-mass distributions. D gives the 
size of the squared fluctuations of p around the center of mass. We will also consider hereafter 

Qii^y^p) = r+ D = {( ps -ps) 2 ) > ( 20 ) 

S 

which measures the averaged square distance between p and p (Fig. (2). 

D,Q,R are intimately related to the statistics of the generalization error, that is, the error on the 
prediction of the average value of a new observable. Assume we have a measure p over the version space 
defined from a set of constraints Let us now consider a new observable, v'. The error on the 

average value of this observable v' computed with a distribution p is simply given by 

A(p) = v' ■ {p - p). (21) 


Suppose now that p is randomly chosen according to measure p and that the components v' s are inde¬ 
pendent and normal random variables, with zero means and unit variances. Let us denote the average 
over v 1 by [■ • ■] It is easy to show that 


Q 


[<^ 2 >] 


R = 




[(A)] 


2 

v' 


D = 


(A 2 ) - (A) 2 


( 22 ) 
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Fig. 2 Schematic view of the space of distributions. The large circle represents the version space of all admissible probability 
vectors, see eq. ©, which depends on the constraints {'U /X }^L 1 . The shaded area represents the typical fluctuations of p 
around the center-of-mass distribution ( p ), of magnitude y/D. The distance between the target distribution p and (p) is 
y/~R, while y/Q is the square root of the averaged squared distance between p and p. These three order parameters measure 
the lengths of the sides of the rectangular triangle shown in the figure, see eq. (1201) . The ME distribution, p MB , lies inside 
the version space. 


Hence, Q represents the averaged square error on a new observable, R quantifies the observable-to- 
observablc squared fluctuations of the error, and D measures the distribution-to-distribution squared 
fluctuations of the error. The squared error on the prediction of the average value of the new observable, 
Q, is the sum of two contributions. The first one, R, is due to the choice of observable. The second one, 
D, reflects the distribution-to-distribution fluctuations with the version space measure. 

To understand the scaling of D 1 Q,R with N let us consider the simple case of no constraint at all, 
M = 0, and no bias on the entropy, r = 0. In this case, due to the permutation symmetry over the 2 N 
configurations, all the configuration probabilities p s obey the same marginal distribution, p s (p s ), see eq. 
m • It is easy to convince oneself, that p s becomes a pure exponential when N 1, with average value 
( p s } = 2~ N . Indeed, the volume in eq. (fldl) is given by 

poo / _ \ pioo poo pioo A 

v = Yi d p s 5 [^2p s -l\ = dAe A Yl(dp s e~ Ap ‘)= dA — w , (23) 

JO s \ s ) J-ioo Jo s J-ioo A 

where we have used an integral representation of the Dirac delta function. This can be directly integrated 
but we here use the saddle-point method, valid for large N for later convenience. The saddle point is 
located at 


A=2 N , (24) 

giving logl/ = 2 W (1 — AT log 2), which agrees with the leading behaviour of the volume of the 2 N - 
dimensional simplex, V = l/[(2 Ar )!]. In addition we see from eq. (TTH) and eq. (l23l) that p s is an exponential 
distribution with mean value 1/yl = 2~ N , as announced above. Hence, 

D = e Nd , Q = e Nq , R = e Nr , A = e NX , (25) 

with q = r = —^ 2 , see eq. ©, and d = — A = log 2. Our calculation with the replica method shows that, 
in the presence of constraints, i.e. for M > 1, the exponential-in-iV scaling of D,Q,R,A will still hold 
after averaging over the constraints, even if a bias r over the entropy is imposed. Note that the rates 
d, g, r, A will then depend on M and r. 
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3.2 Description of the learning process 

We first focus on the effect of increasing the number of measured observables, M, and set the entropic 
bias r to zero. The effect on non-zero biases will be reported in sec. 13.31 We assume in sec. 13.2.11 and sec. 
13.2.21 that the tolerance E is negligible; the dependence of the results on the value of E is exposed in sec. 
I3T31 

3.2.1 The learning edge 

One major result of the replica calculation is that the marginal measure over the single-configuration 
probabilities, p Sl eq. m, may have two distinct behaviours, sketched in Fig. [3] 

— The marginal measure over the probability p s of a configuration s having a large value p s in the target 
distribution is concentrated at a value close to this target probability: 

p a (p s )~Ae- B ^-*°- s °f/ 2 , (26) 

where the values of A and B depend on the parameters M, E, and will be specified later. The corre¬ 
sponding curve of p s is shown in Fig. El left- The shift S s is small compared to the target probability 
p s , as wc will see in sec. [4] In other words, the configuration probability p s is correctly ‘learned’ from 
the values of the constraints. 

— The marginal measure over the probability p s of a configuration s having a small value p s in the 
target distribution is a decaying exponential: 

7 e ~ Ps/A ' ■ (27) 

The corresponding curve of p s is shown in Fig. [3l right. The average value A' of the configuration 
probability p s does not depend on p s in the dominant scaling with N. A' is a function of the parameters 
M, E only, and will be specified later. In other words, the configuration probability is not at all inferred. 


P» 




p 


Fig. 3 Schematic pictures of the marginal measure p s (p s ) for large (left) and small (right) configuration probabilities p s . 


The concepts of large and small target probabilities are defined as follows: 

p s = e~ Nul ‘ is large if u> s < ui , and is small if w s > u> . (28) 

The boundary uj is hereafter referred to as the learning edge, as it separates the set of probabilities p s 
into the ones which can be correctly learned and the ones which remain essentially unknown, despite the 
knowledge of the observable values. Its value depends on the parameters M and E. The representative 
curve of the learning edge u> is an increasing function of the rate 

log M 


a = 


N 


(29) 
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and is shown in Fig. 0] for the ISM and a negligible tolerance E. We find qualitative changes of the curve 
cD(a), taking place at critical values of the ratio a and corresponding to the onset of phase transitions in 
the learning process. 


co a i =a) i log 2 



Fig. 4 Learning edge w as a function of a for the ISM © with H = 0.5 and for small E. 


3.2.2 Phase transitions and typical distances between distributions 

Given the value of the learning edge Cj we define, according to eq. (l28l) . the sets of configurations s 
having large and small probabilities as, respectively, L and S. Our replica calculation shows that the 
order parameters Q and R are given by, to the dominant order in TV, 

Q = R = J2pI (30) 

sGS 

If M is small, so is the learning edge w, most configurations are found in S. This implies = 

= e~ Ni2 . As M increases, the learning edge grows, and reaches a) = uq. For larger values of M, 
the dominant point w = u >2 is now not in s £ S and hence the rates q and r change from —£ 2 . This leads 
non-analyticities in Q and R, and the transition point is at Cj = uq. Similarly, a switch of the dominant 
term in the normalization condition 1 = XOsesP® A from the S to the L occurs at Cj = uq. The 

Lagrange multiplier A is non analytic at this point, which defines a second phase transition. 

Hence, we have three distinct phases, separated by specific values of the learning edge. We label the 
phases by ‘I’ when Cj < uq, ‘II’ when uq < u) < uq, and ‘III’ when uq < Cj. The value of the ratio a 
corresponding to the transition point Cj = uq is denoted as aq, and the one corresponding to Cj = uq is 
called aq. In phases I and II with small E, Cj turns out to be equal to a. Thus <22 = uq and a\ = uq. 

To get insights about the typical distances between the distributions of interest in the version space 
we plot the order parameters q , r and d of the ISM in Fig.[5j Those order parameters continuously change 
at the transition points. At a = log 2, the learning edge becomes equal to uj 0 ; q and r reach the same 
value as d. The agreement between r and d , the distance of the center of mass to the target distribution 
and the typical fluctuations, implies that the volume largely shrinks and scales with TV in a different 
manner at this point. Our calculation, restricted to the leading order in the volume, is not informative 
for larger ratios, i.e. a > log 2. 

3.2.3 Effect of the tolerance E 

Let us now consider the case of a non-negligible tolerance E. We give the corresponding phase diagram 
in Fig. [G] It is natural to scale the tolerance as E = e Ne . As e grows from very negative values, we expect 
that the quality of the inference becomes worse, as the version space include more distributions, which 
do not exactly reproduce the target values of the observables. Indeed, for large values of the tolerance E, 
there emerge new phases where the learning edge decreases as e grows. We call these phases with large 
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Fig. 5 Order parameters q,r,d as functions of the learning edge Cj for the ISM {6]) with H = 0.5 and for small E. The 
critical values of uj are ujq « 0.81, cji « 0.58, and UJ 2 ~ 0.43, and the corresponding entropies are ao = log 2 « 0.69, 
cr i « 0.58, and a 2 ~ 0.37, respectively. 
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Fig. 6 Phase diagram in the a-e plane in the absence of any entropic bias. 


tolerance Ilt? Ult? and HIlt? in agreement with the denomination chosen above based on the different 
ranges of possible values for the learning edge. 

The transition from small to large tolerances takes place at E = D. This result is easy to interpret: 
deviations in the values of observables from the target ones smaller than the scale of the intrinsic fluc¬ 
tuations D will be masked by those fluctuations, and cannot degrade the inference performances. The 
results shown in sec. 13.2.11 and sec. 13.2.21 are therefore valid as long as e < d. 

In the present study, we do not consider any measurement error: the measured values of the ob¬ 
servables correspond exactly to the projection of the observable vectors onto the target distribution. If 
measurement errors were considered, the role of the tolerance parameter could possibly change; an ap¬ 
propriate amount of tolerance E, of the order of the measurement error, would lead to better inference 
by avoiding overfitting. 


3.3 Effects of the entropic bias 

We now consider the effect of an entropic bias T > 0 on the learning performances. As T grows, the 
measure p gives more and more weight to the distributions p with large entropies. In the r —> oo limit p 
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singles out the ME distribution. The T-dependence of the quantities of interest, in particular, the learning 
edge, is of key importance. Similarly to the other parameters we assume that the entropic bias scales 
r = e Nl . 

We first consider the case of negligible Em 0. Our replica calculation shows that the order parameters 

Q, R and the learning edge u) do not depend at all on the value r > 0. An immediate consequence is that 
the ME distribution is not closer to the target distribution than any randomly picked up distribution 
(with the unbiased measure) in the version space. Nevertheless the presence of an entropic bias affects 
the fluctuations of p , measured by the order parameter D. For a < a i, our replica calculation shows that 
D = 2~ n for small T, while D = T _1 for large r. The transition between these two scalings takes place 
at 7 C = log 2. The same transition for D is observed when a > a i, though the transition point q c now 
depends on a. The order parameter A shows non-analyticities at those transition points, contrary to Q, 

R, and the learning edge, which remain unaffected as mentioned above. The non-analyticities in D and 
A discriminate the phases for 7 < y c from the ones for 7 > 7 C ; we denote the phases with large entropic 
bias by Ime,Hme, and IIIme, in agreement with the denomination based on the ranges of values of the 
learning edge. The corresponding phase diagram is shown in Fig. [3 left. On the right panel of the same 
figure, the order parameter d is plotted against 7 for the ISM with H = 0.5 and a < a\ as an illustration. 


Y 



d log2 



Fig. 7 Left: Phase diagram in the 0-7 plane for very small tolerance E. The fluctuation order parameter D = e Nd depends 
on 7 in the phases with large entropic biases, Ime, Hme, and IIIme, while it keeps the same value as for F = 0 in I, II, and 
III. Right: order parameter d as a function of 7 for the ISM and o < a. 1 . A non-analyticity appears at 7 = 7 c (= log 2), 
which signals the onset of the phase with large entropic bias for 7 > 7 c . 


The phases above change when the value of the tolerance E = e Ne is not negligible any longer. As 
in the T = 0 case the tolerance matters if E > D. As the fluctuations scale as D = T -1 for large r 
with a < au, this implies that a regime takes place for 7 > —e. This large tolerance, large entropic bias 
regime shows unusual properties. For instance the learning edge becomes smaller and smaller, that is, the 
learning performances become worse and worse, as T grows, in contrast to the case of negligible Em 0 
where the learning edge has no dependence on r. This seemingly-strange behaviour is actually expected. 
We see from eq. that, for very large r and finite fixed E, the constraints become irrelevant. Hence, 
in this limit, the measure concentrates around the uniform distribution = 1 / 2 ^, maximizing the 
entropy irrespective of the constraints. To get an meaningful ME distribution, the tolerance E must be 
enough small compared to the fluctuations D governed by T in the large T limit. In conclusion, this large 
tolerance, large entropic bias regime is irrelevant and will not be studied further. 
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4 Analytical calculations 

In this section, we present our replica calculation of the average value of the logarithm of the volume V 
over the constraints. We introduce the notations 

Tr(---) = ^ n dps<5 ^ (•••), (31) 

and Dz = dze~^ z2 /y/2n. We also write undefined integrals as a shorthand notation for the domains of 
integration [— 00 , 00 ] or [—*oo,ioo]. 


4.1 Replica calculation of the average logarithm of the volume 

We start by unraveling the squared terms in the exponent of volume OSD through auxiliary Gaussian 
integrations, or the so-called Hubbard-Stratonovich transformations: 

V = / ri jD ^ Trexp { i J2J2^ v ^ Ps ~P^ + rS ^ f ■ 


1 


(32) 


According to the replica method we calculate the nth moment of the volume of n £ N 


m=n /n^s 


“P EE - M + r E s <r“) 


z a z b 


= n (/ n 5 ] “P i - £ £ + r £ s{p a ) 


a— 1 


M 


= n ^ exp 1 ~ w ^ 


a= 1 


H a,b 

Q" 

E 


(33) 


rJ2s(p a )\, 

where the square brackets denote the average over the observables as in sec. 13.11 and we define 

Qab = J2(Ps - Ps)(jp b s - Ps)- (34) 

S 

To perform the integrations Tr p a, we make use of the following identity 

1 = [ n d( 3 af >II (5 ( Qa.b~J2( P s ~ Ps )(P« ) 


i<b 


i<b 


= C I n dQabdQ'ab exp l \ Qo-bQ'ab - \ Q'abiPs ~ Ps)(Ps ~ Ps) 


(35) 


a <6 


a<b 


a<b s 


where the normalization constant C is irrelevant and will be discarded hereafter. Similarly, we rewrite 


Tr p a as 




E.(p»-p«) 


Tr = J dA a 

where we used the normalization identity 1 = We then obtain 

in=/ n d Qab d Qab [ n dA * fnn^ 


(36) 


a<b 


x exp 


~p^~ 9 J2J2^(Ps -Ps)(p b s -Ps)+rj2 s (p a ) 


a s 


2<b 


exp | \ QobQ'ab - V !°gdet f 1 + § 


a<b 


(37) 
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The logarithm of [V n ] can be approximated by the saddle-point value of 

<P( n ) = ~ y !og det (l + +^log0 s , (38) 

Z a<b ' ' s 

over the matrices Q and Q 7 , and where 

®S= [ n dp a s exp f - ^ vl a (p“ - p s ) - 1 ^ Q' ab {p a s -p s )(p b s -p s )~rJ2p a s logp* ] ■ (39) 

0 a y a a<b a J 

We now assume that the order parameter matrices are invariant under permutation symmetry of the 
replica indices, and write 


Qab = R+(Q- R) 5 ab , Q’ ab = -2 R’ + (Q' + R’) 5 ab , A a = A , (40) 


where S ab is the Kronecker delta. Thus 

\ E QabQ'ab = yQ(Q' - R') - i?r(?z - l)i?i?', 


i<b 


log det (^1 + ^ = log ^1 + Q + ( n ^ + (n _ 1 ) l og 


1 + 


Q-R 

E 


and, 


lY'Q'M -p.) = - \r' (E(rf -p«) 


i<b 


Using the Hubbard-Stratonovich transformation again eq. becomes 

O s = J Dz J Y[dPse~^ A ~ z ^ {pa ‘~ M ~^ Q ' (p: ~ M2 ~ rp ‘ i° srf = J DzX r s 


Hence, 


<j>(n) = 2 n Q(Q' ~ R> ) ~ 7; n ( n — 1 )RR' 


M 

T 


log 1 + 


Q + (n- 1 )R) 
E 


+ (rc - 1) log ( 1 + 


Q-R 

E 


E log / DzX s 


We finally obtain the expression for [log V] = lim„_>o 4 > ( n )/ n i 

[logV] = - X) + i RK - f { + log (i 


Q-R 

E 


(41) 

(42) 

(43) 

(44) 


(45) 


E / DzlogX., (46) 


where 

The integration over p in eq. (T471) has to be done with care. We assume that the conjugated order 
parameters, Q' and R', obey an exponential scaling with N, 

Q’ = e Nq \ R’ = e Nr ’. (48) 

Therefore, all the parameters appearing in the integration diverge or vanish when N —> oo. We need to 
find out which order parameters diverge or not, consistently with the equations of state derived below. 
These procedures require involved case analyses. The outcome is that the equations of state are consistent, 
to the leading order in TV, if the following conditions are met: 

0 < A < q' < 2A, 0 < A < r' < 2A, q = r < 0, e < —-y < 0. 


j 


dp e 


- ( A-zy/Rj) (p-Pb)-^Q' (p-Ps) 2 -rp\ogp 


(47) 


( 49 ) 
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4.2 r = 0 case 

Iii this case, the integral in eq. (HU) can be evaluated in two different ways. The first one is more direct, 
but the latter one is useful to clarify the physical significance of the solutions and to treat the finite-T 
case. 


4-2.1 Direct integration and expansion 


The first strategy to evaluate X s is to directly integrate eq. ED in r = 0. The result is 


X. 



dp e -( A - z ^R 7 )(p-pD-hQ l (p-Ps) 2 



where we put y s = (A — p s Q')/VO 7 and we define the complementary error function 


Thus 



Dz. 


Ilogll = \ Q (Q' - R') + \RR' - f +log 

+ 2JV (^^ + ^ l0g27r_ ^ l0gQ, ) + ? / DZl0gH 



(50) 


(51) 


(52) 


The definition of y s allows us to introduce the learning edge in a natural way. Under assumption El, 
y 3 = (A — p s Q')/VO* diverges. If the target probability p s = e^ Nu “ is small enough, i.e. if oj s > q' — A, 
the dominant term in y s is A/\/Q r 1 which goes to +oo in the thermodynamic limit. On the contrary 
we find y s —> —oo if u> s < q' — A. This sharp difference defines the learning edge and the corresponding 
entropy as 


uj = q’ — A, a = cr(w) (53) 

Furthermore, based on assumption (l49l) . |y s | increases faster with N 1 than y/R!/Q ', entailing that the 
argument of the complementary error function H in eq. (15011 is dominated by y s . Thus, the complementary 
error function can be replaced with its asymptotic behavior, depending on the sign of y s , 


H{y) —t 


•v/^TT V 


, (y —> oo) 
(y -t -oo) 


Using this, we get 



E°- 

seL 


The logarithmic term in the above formula can be expanded according to eq. ED as 


(54) 


(55) 



log A — - log Q' 


PsQ' + VR!z 


1 {PsQ' + VR'z) 2 

2 A 2 


log \ y s - z 


A 


(56) 
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Linear terms with respect to z vanish upon Gaussian integration. We are left with 


[lo g V] « \Q(Q' - K) + \ R R' - f + >°S (l 


+e 


Na 


A 2 + R! 1 


2 Q 


—+ -log 2 ^--logQ') -2 (logyl-- — 


1 R' 




ses 


ses 


where we have used 


2 n - i = 1 = eNa > 51 1 = 2Ar - 


These formulas are correct as long as a) < Loo¬ 


ses 


Q-R 

E 


(57) 


(58) 


4-2.2 Saddle-point approximation 


Our second method to evaluate X s in eq. (151)1) is based on the saddle-point approximation. Writing the 
exponent as f s {p) = — — zy/R/'j (p — p s ) — \Q'(p — p s ) 2 in eq. (15U1) and differentiating with respect 

to p we get the solution of the saddle-point equation f' s (p*) = 0 : 


* -A: 

V =Vs~ 


A-zVR 7 

Q’ 


(59) 


As we know from the previous calculation log (A/Q') determines the learning edge. If p s > A/Q' the 
dominant contribution to p* is p s itself and becomes positive, implying the saddle point is in the feasibility 
domain: 


A, 


dx 


g/s(p*)--|l/s'(p*)l a:2 = g/.(p* 


2 tt 


\mp*)\ 


= e 


( a-zVr 7) 2 

Q 7 


(60) 


This result coincides with eq. 0 with H(y) —> 1, y —> — oo. To check the validity of the saddle- 
point approximation we can compare the peak location p* = p s to the width of the Gaussian, a = 
l/y/|/"(p*)| = l/VW- Under assumption (£37)1) we see that u s < £j = q' — A < q'/2 and thus p* > a 
to dominant exponential-in- N order. The peak height f s (p*) also diverges rapidly with N, and the 
saddle-point approximation is justified. 

In the case of configurations s £ S with small probabilities, uj s > u>, the dominant term to the right 
hand side of eq. (l59l) is — A/Qp* becomes negative and lies outside the domain of integration. Hence, the 
true saddle-point value is p* = 0. We expand f s (p) around p = 0 up to the first order, and approximate 
the integral as 


Xs 


dp 


D /»( 0 ) + /'( 0 )p 


e (A-zVW)p a -^Q'pl 

A - Q'p s - zy/W 


(61) 


This is again identical to eq. m in the limit H(y) « e y 2 / 2 /(y/2py ) with y = y s — zy/R’/Q 1 . 

In addition, the saddle-point calculations above give us the expressions for the marginal measure p s 
given in sec. 13.21 Indeed, the marginal measure of the probability of the configuration s reads 

Ps{Ps) = 4 -■ e/s(Ps) - (62) 

^S 


Therefore p s is approximately an exponentially-decaying function (1271) for small target probabilities, and 
a Gaussian centered close to the corresponding target probability C6l) for large target probabilities. 
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4-. 2.3 Equations of state and their solutions 

Here we derive the equations of state (EOSs). Taking the derivatives of eq. (l57l) with respect to the order 
parameters, we get 


R' = 


MR 


Q = [1-2 


( E+Q- R) 2 

Q' 


Q' = 


M 


E + Q - R’ 


J A 2 


Z*- 5 E 


seS 


Ps 


_ No¬ 


tes 


A 2 + R' + Qf 


Q 


i/2 


D=Q-R= 


->N 


A 2 


Na A 


IN 


Q' A 


+ 1 - 


0 N& 

~Q r ’ 

Q!_ 

A 2 


- Q ' 2 sr^ -2 n 

w2^Ps=°- 


ses 


ses 


(63) 

(64) 

(65) 

( 66 ) 


The order parameter D defined in sec. [3] naturally appears. According to assumption CiU) . we know that 
all the terms with the factor Q'/A 2 are not dominant in eqs. (16311661) : the same statement applies to the 
terms with the factor e Na in eqs. (164116611 . Matching the dominant contributions to the remaining terms 
we obtain 


<? = £ 


o N 

p 2 *> — = 

ses ses 

ON 


D = 


A 2 


(67) 

( 68 ) 


The physical meaning of the first equation is clear. The order parameter Q = Yf s \ {Ps — Ps) 2 J quantifies 
the average squared distance of a distribution p (chosen with measure p) to the target distribution. 
The contributions coming from the large probabilities are negligible because, as we have seen in sec. 
14.2.21 the marginal measures for large probabilities are centered close to the target values, which makes 
~ Ps) 2 ^ very small. However, for the configurations for small target probabilities, (p^,) is much 

smaller than p 2 . Thus Q is, to the leading order in N, equal to p 2 . We see, in addition, from eq. © 

ses 

that A ensures the normalization condition, and is equal to 2 N as long as most configurations are in the 
set S. According to eq. (1671) . we find 


q = r = 


-l 2 (u> < w 2 ) -v 

G — 2d) (ct>2 < w) ’ 


log2 (a) < uii) 

log 2 — a + Cj (a/i < Cj) 


d = log 2 — 2A. 


(69) 


The other order parameters q',r' and u> are computed accordingly. 


Small-E case. We can ignore E when it is smaller than D = Q — R. In this case we get the EOSs for the 
parameters q' and r’: 


r' = a + i — 2d , q' = a — d, (70) 

and the learning edge is self-consistently determined by eq. O- Solving theses equations, we obtain 
Phase I (id < 0 / 2 ): 

A = log2, Cj = a, q =a + log2, q = r = ~l 2 , / = a + 21og2 — £ 2 , d = —log2. (71) 
Phase II (uj 2 <Cj<uj i): 


A = log 2, Cj = a, q' = a + log 2, q = r = a(a) — 2a, 
r' = a(a) — a + 2 log 2, d = — log 2. 


(72) 
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Phase III (uq < Cj): 

A = log2 + d) — ct, u> = a~ 1 (a), q' = log2 + 2cD — a, q = r = a(Cj) — 2CC, 

r' = 2(log 2 +to — a) = 2A, d = — log 2 + 2a — 2d). (73) 

The critical ratio separating phases I and II is given by a 2 = u> 2 , while the critical ratio associated to 
the transition from II to III is a\ = aq . We can easily check that those solutions satisfy the assumptions 
(S3): and are therefore self-consistent. 


Large-E case. If E = e Ne is larger than D = e Nd , i.e. if e > d, the EOSs of the parameters q' and r' are 
modified as 


r' = a + r — 2e , q' = a — e . (74) 

These large-tolerance EOSs become valid beyond the critical values e(a): 

e(a) = — log2 (Phase I,II), £(a) = — log2 + 2a (a) — 2d)(a) (Phase III). (75) 

To distinguish the solutions above from the ones of the small-E case we denote phase I with large tolerance 
by Ilt as explained in sec. 13.2.31 Phases IIlt and IIIlt are defined in the same way. The solutions of the 
large-tolerance case become 

Phase Ilt {Cj < w 2 ): 


A = log 2, Cj = a — e — log 2, q' = a — e = w + log 2, q = r = —£2 , 

r' = a — 2e — £ 2 = 2 log 2 + 2w — — ol, d = — log 2. (76) 

Phase IIlt (w 2 < Cj < wi): 

A = log 2, u) = a — e — log 2, q = a — e = u + log 2, q = r = a — 2w, 

r' = a + a — 2<2) — 2e = a + 2 log 2 — a, d = — log 2. (77) 

Phase IIIlt (wi < d>): 


A = log 2 + ih — a, <7 — 2 Cj = log 2 + e — a, q' = a — e, q = r = a — 2 Cj 1 

r' = a -(- a — 2d) — 2e = 2 log 2 — a — a + 2d), d = — log 2 + 2 ct — 2d). (78) 

It is instructive to examine the validity of the condition e > c? in phase IIIlt- To do so we need to know 

the rate of increase of a — Cj with e at fixed a. Differentiating the equation determining the learning edge 

in eqs. m, we get 


<9(<j — d)) 
dl 



This equation can be written as 


doj 

(da 

77 = 

177 



(79) 


(80) 


The inequalities come from the condition da/ddi < 1, which holds in phase III. This implies that d 
increases more slowly than e itself in IIIlt, since dd/de = 2(1 + dtu/de) < 1 . Thus the necessary condition 
e > d is satisfied in phase IIIlt, as it should. This condition will be useful for the study of the stability 
of the RS ansatz. 
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4.3 Large-T case 

4-3.1 Saddle-point approximation 

We use the saddle-point approximation in the integral defining X s (HT1) as in sec. 14.2.21 We call f s (p) the 
exponent in the integral, and write down the derivatives 


f s (p) = Q'(p-p s f - (A - zVB/)(p - p s ) - rplogp, 
f's(p) = -Q'(.P-Ps) - (A-zVR 7 ) - r(logp + 1), 

p 


(81) 

(82) 

(83) 


The main difference with the r = 0 case is the presence of the term plogp, singular at p = 0. As a 
result the functional behavior around p = 0 changes completely, and there is always a peak in [ 0 ,oo]. 
This implies that the saddle-point approximation can be applied, irrespective of the target probability 
value. The saddle-point equation /', (p* ) = 0 can be written as 


P =Ps- 


A - zy/Rj + r(l + logp*) 

Q' ' 


Let us make the following assumption, correct when r is large enough, 

A 


r 


= O(N). 


(84) 


(85) 


This relation, in turns, defines the meaning of ‘large’ r. This assumption, combined with eq. HMD , gives 
us the value of the learning edge from eq. ((Ml) : ui = q' — A = q' — 7 . For configurations with large target 
probabilities the dominant term is p s in eq. (1841) . and an iterative substitution yields 


* „ A- zVb 7 +r(i +logp s ) 

Ps~Ps - Q, -, (s G L). 


( 86 ) 


Expanding f a (p) up to the second order, we may evaluate X s . 

For configurations corresponding to small target probabilities, the saddle point is very different. We 
a priori postulate the expression of the solution 


Ps 


■ P + 4\ s , (s £ S), 


(87) 


where the dominant term p is 


p = e -i -{A-Q'p a -zVW)/r < (Vs e _ ( 88 ) 

Sec.[A]gives a reasoning of this form, p is required to take this value to satisfy the normalization condition. 
According to the saddle-point calculation above we have 

1 = ~J2p + J2p s - ( 89 ) 

\ s / seS sei 


For small learning edge values, u < uq, configurations with small probabilities dominate, implying that 
Y^sgsP ^ 2 N P ~ 1 =>• p = 2~ N . Thus, p is automatically determined by the normalization condition. 
Moreover, the correction term A s in eq. (l87ll is determined by expanding eq. (l84l) with respect to A s to 
the first order and solving the resulting equation. We find 


Q'p 2 






Q'P + r ' 


(90) 


This expansion is valid if p > | |, yielding an trivial inequality 1 > Q'p/(Q'p + r). 
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4-3.2 Equations of state and validity of the saddle-point approximation 


It is convenient to compute the derivatives with respect to the order parameters of the expression in eq. 
dED. The result is 


Q = E [ Dz Up-Ps) 2 ) , 

s J Xb 

D = Q - R = ^2 J Dz j((p-Ps) 2 ) 

0 = £ / Dz (p - p s ) Xa 

S ^ 


X s 


(P - Ps ) 



(91) 

(92) 

(93) 


where we define 

1 r°° 

{(-))x. = ^r dp{---)e^\ (94) 

JO 

The EOSs for R' and Q' are identical to eq. HMD, and are thus omitted above. The average value of p is 
replaced with the saddle-point value p*. We need to quantify the fluctuations around the saddle-point to 
estimate the terms in eq. (ED), that is, 


/ 2 \ , „ ,2 e^l) f dx x 2 e ^f^* 2 _ 1 (Q’ 1 (s £ L) 

\ (P {P Ps)x ° “ eUK) j dxe -H^i}l^ - |/"(p*)| - { f(«6S) ' 

Note that, for s £ S, we can write p* — p s = p — p s . These considerations lead us to 


Q = E / Dz<j-p.f + T. I D *{{- — + - r ‘° 8;i * - ^ 


ses ‘ 




= 2 N f-2pY J Ps + Y J Pl + eNa 


s GaS s£S 


A + r 

~Q r ~ 


Q 1 


1 - 


Q’ 


Nihr \ ■ 


Rf 


A + r) + (A + ry [' 


D = E / d 4 +£ /= 2A T + V 


Na 


ses - 


r 


s€EL * 


r o' 


<>=•£[Dzif-M+Y, hA-— 


s eS‘ 


= 2 N p-Y J Ps~e Na 
sGS 

The dominant terms turn out to be 


s£L 

A + r 
Q' 


Q' 
Nu>r\ 

~ A + r) ' 


Q = £^ 

seS 
D = 2 N 

r’ 

2 N p = "^2'Ps 

sGS 


Q' 


with the corresponding exponents, 


q = r = 


-£2 (w < w 2 ) 
<T — 2(D (oi2 < u>) 


, d = 


— 7 (w < Wi) 
(j — u> — 7 (wi < w) 


A = 7 + 


log IV 1 J log log 2 (d) < wi) 


N N 1 log (log 2 + d) - < 7 ) (wi < d>) 


(95) 


(96) 

(97) 


(98) 

(99) 

( 100 ) 

( 101 ) 

( 102 ) 

(103) 
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Small-E case. Again we can neglect E(<^. D). The EOSs for q' and r' and the equation for the learning 
edge are fully identical to eq. Col) and eq. m ■ Each phase is then characterized as follows: 

Phase Ime (Cj <uj 2 ): 


A = 7 + / g ^ ~ q' = ot + 7 , q = r = —£ 2 , r' = a + 2y — £ 2 , d = — 7 . (104) 


Phase IIme (^2 < Cj < u>i): 


^ , log(IV log 2 ) A , , , 0 

A = 7 -—-, uj = a, q = a + 7 , q = r = a(a) - 2a, 

r' = a (a) — a + 27 , d = — 7 . 


(105) 


Phase IIIme (wi < Cj): 


log(7V(log-a + w)2) A , 

A = 7 +-^-, w = CT (a), g 

r' = 27 , d = & — Cj — 7 - 


w + 7 , q = r = a — 2 a), 


(106) 


It is easy to check that the assumption eq. (Hill) is satisfied by these solutions. Note that the learning edge 
does not depend on 7 . 

We now examine the validity of the saddle-point approximation for X Sl to determine when the so¬ 
lutions above are correct. To do so we compare the width of the Gaussian with the location of its peak 
value, and check that the peak height diverges. For example, in the phase Ime, 

Peak location: 


P 


* 

S 


f p a = e Nu °, (s e L) 
\p = e~ Nl0 s 2 , (seS)' 


(107) 


Height: 


fs{p*s) 


e JV( 7 +*-2 W „), ( s£jj ) 
e JV( 7 —log2), ( flG5 ) 


(108) 


Width: 


= i/>:)i 


-1/2 _ 


Ps 


Q’p* + r 


( Q') 1/2 = e N , (s £ L) 
^ = (a €5) 


(109) 


According to these relations the saddle-point approximation is correct if 7 > log 2 = 7 c . The same 
approach can be applied to IIme and IIIme, with the resulting critical values: "f c = log 2 in IIme and 
7 C = log 2 + Cj — a for IIIme- These critical values are consistent with eq. (1H51) . as A is a continuous 
function of E. 

This continuous change also implies that, in the 0 < E < T c = e Nlc region, all solutions coincide 
with the ones found for r = 0. Consider for instance the peak value of the marginal measure over p s 
for s £ S. Assuming the order parameters are given by their expressions in the T = 0 case, we see 
that the condition eq. TO is satisfied, which implies that eq. is valid. The saddle point for s £ S, 
p = Ps-z'/R')/r, decayg j n a double-exponential manner with respect to N, as A is exponentially 

larger than r. Similarly, the peak height f s (pt) rapidly goes to zero for s £ S. The peak location, p, 
decreases faster than the width a s = \JpfE. As a consequence the functional form eJ“ < - p ' 1 rapidly converges 
to a pure exponential distribution, as in the r = 0 case. 


Large-E case. The EOSs for q 1 and r 1 coincide with eq. (El). The critical values of E are determined by 
E = D: 


e(a, 7 ) = —7 (Phase I, II), e(a, 7 ) = a (a) — 6j(a) — 7 (Phase III). (HO) 

New solutions can appear for e > e(a, 7 ), but should be discarded as they violate the condition E < T, 
see discussion in sec. 13.31 
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4.4 Stability of the replica symmetry 


We study the de Almcida-Thouless stability of the replica-symmetric (RS) Ansatz, see eq. (HU1) . against 
fluctuations in the replica space [5j. Detailed calculations are reported in the Appendix, sec. Q3] The 
outcome of the calculation is the following stability condition against replicon fluctuations: 


_ M. _ ly 

(e+q-r) 2 \y 


d 2 \ogX s 

* V dA 2 



( 111 ) 


The term in the brackets is asymptotically equivalent to e Na /Q' 2 . This can be shown in the small -T 
case (r < r c ) using eqs. (I60I61|1 . and in the large-T case (T > R.) with the relation d 2 log X s /dA 2 = 
((P — p) 2 ) x ~ ((P~P))x B and eq. (l95l) . From eq. (IM1) . we obtain a transparent interpretation of the 
stability condition 


M 

(E + Q - R) 2 



-— = < i. 

M 


( 112 ) 


Therefore the RS solution becomes unstable if the number of large-probability configurations exceeds 
the number of constraints. Inserting the solutions for <f in the small E regime, see eqs. (171117511 and eqs. 
(1104111061) . we find that, irrespective of the value of T, the RS ansatz is stable in phases I and II but is 
only marginally stable in phase III. Marginal stability means here that <f is equal to a in phase III. Our 
calculation, limited to the leading order in N, cannot decide whether phase III is actually stable. 

In the large tolerance E regime, the RS solution is stable across all phases, even in phase IIIlt- Simple 
calculation based on eq. (1751) yields <x — a = d — e < 0, where the last inequality is proved at the end of 
sec. 14.2.31 

To summarize, the RS ansatz is stable in all phases, but only marginally in phases III and IIIme- This 
result may be related to the ‘simple’ structure of the version space. Consider the case of zero tolerance, 
E = 0, and two distribution vectors, p\ and P 2 , in the version space. Any linear combination of these 
two vectors, pt = tp\ + (1 — t)p 2 with t € [0,1], is a normalized distribution and lies in the version space. 
Hence the version space is convex and connected. The instability of RS ansatz, which is usually related 
to the appearance of many disconnected and far apart components, may therefore not take place. 


5 Numerical simulations 

We now present a numerical procedure to sample the space of distributions with the measure p. Due to 
the exponential growth of the version space with N this procedure is applied in practice to small values 
of N < 10. However the results confirm the analytical calculations reported above, and provide insights 
about finite-size effects. We restrict to the case of zero tolerance (E = 0) throughout this section. 


5.1 Sampling algorithm 


We resort to a Monte Carlo (MC) sampling method, in which the distribution p is updated at discrete time 
steps. Each step corresponds to a random change of the current probability vector from p to p' = p + Ap. 
The move Ap must satisfy the following conditions: 

Orthogonality to observable vectors: Ap ■ v M = 0,V/r. 

Normalization: ]T) S Ap s = Ap -1 = 0 where 1 = (1,1, ■ ■ ■ , 1). 

Positivity: p s + Ap s > 0 ,Vs. 

The orthogonality condition ensures that the constraints keep being satisfied at all steps provided they 
are fulfilled by the initial value of the distribution. The same statement applies to the normalization 
condition; we will specify below how the initial condition is chosen. The orthogonality and normalization 
conditions restrict the possible directions w (normalized to unity) of the move. Once this direction w is 
chosen we determine the range [ x m i n ; x max \ of the allowed amplitudes x of the move Ap = x w to fulfill 
the positivity constraint: 


*min = min max 

S 


-Ps 


W s 


1 


~Ps 

W s 


Xmax = max min 

S 


( 2 Ps_ 1 -Ps \ 
V Ws Ws ) 


(113) 
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Any intermediate values between these two bounds may be chosen uniformly at random. Next, we calcu¬ 
late the entropy difference AS = S(p') — S(p) and accept the move p —> p' according to the Metropolis 
rule, i.e. with probability 


Paccept = min (l, e rAS ) , (114) 

and reject it (leave p unchanged) with probability 1 — Paccept- A picture illustrating one Monte Carlo step 
is shown in Fig. [8] 



Fig. 8 Schematic picture of the version space and of one Monte Carlo step. The version space is convex, and includes 
the target distribution p. From a distribution p, a direction w inside the solution space is randomly chosen. We choose a 
point on the segment along this direction (dashed line) uniformly at random, and p is updated to the chosen point p' with 
probability p a ccept, see eq. HI 11 1 . 


To implement the algorithm, it is convenient not to assume that the observables are Gaussianly dis¬ 
tributed, as in the analytical treatment. Instead we consider the Fourier modes w k on the ./V-dimensional 
hypercube, where k = (fci, k 2 , ..., fcjv), with ki = 0,1 for each i, denote the wave-number configurations. 
The 2 N components of those Fourier modes are given by 


N 


£ ) = 

/ s 


VW 


[[(*)* 


i— 1 


(115) 


The Fourier modes {w k }k form a complete orthonormal basis of the 2 JV -dimensional vector space. Note 
that w° = 1/ V2 N , where 0 is the all-zero wave-number configuration. 

In the analytical calculation of sec. S] we chose the observable v to be random vectors, with Gaussian 
components. Any two randomly chosen vectors have very small scalar product with high probability in 
the large -N limit. In our numerical simulation we rather choose the M observables uniformly at random 
over the discrete set of 2 N — 1 Fourier modes w k , with k ^ 0. This choice is convenient since the Fourier 
modes are orthogonal by construction, implying that the orthogonality condition mentioned above is 
automatically satisfied as soon as we choose the direction of the move w as one of the Fourier modes 
outside the set of observables. The normalization condition is easily satisfied as long as we exclude w° 
from the set of possible directions for the move. We are thus left with a set of 2 N — M — 1 Fourier modes, 
each of which is a possible direction for the Monte Carlo move. Clearly our Monte Carlo Markov Chain 
satisfies detailed balance and is irreducible. 


5.2 Results 

Using the above sampling procedure we calculate several quantities of interest including order parameters, 
histograms of p s , and spin-spin correlations. The target distribution we consider is again the ISM dH]) with 
H = 0.5. 





Learning probabilities: maximum entropy and others 


23 


5.2.1 Check of equilibration 


We have run simulations for ten different samples (iV samp ie = 10) for sizes N = 4, 6 and 8, and for two 
samples (Wampie = 2) for size N = 10 to compute the values of the order parameters. The error bars are 
estimated through the standard deviation er S ampie across the samples: 


Error bar = 


& sample 


\J -^sample 1 


( 116 ) 


A Monte Carlo step is defined as one attempted move, irrespective of its acceptance. Fig. [9] shows the 
plot of the order parameter q against the number of Monte Carlo steps, indicating how the system 
approaches equilibrium. From the figure we see that thermalization becomes drastically faster for larger 




Fig. 9 Plots of q vs. the number of Monte Carlo (MC) steps for N = 10 spins and for different values of the number M of 
constraints. The left panel corresponds to r = 0 (no entropic bias), and the right panel to f = 2 N = 1024. 


r. For example, simulations with M = 32 constraints and N = 10 spins require at least Nmc ~ 2 x 10 8 
steps for thermalization when r = 0, while Nmc = 2 x 10 7 steps seem to be sufficient when r > 2 10 . This 
trend is easy to understand, as an increase in the entropic bias concentrates more and more the measure 
around ME distribution, hence shrinking the space to be sampled. 

The actual number of Monte Carlo steps used in our simulation varied depending on the quantities 
we wanted to estimate. To compute the values of the order parameters we chose Nmc = 1 x 10 7 for 
N = 4 and 6, Nmc = 3 x 10 7 for N = 8, and Nmc = 3 x 10 8 for N = 10. One third of those steps are 
discarded in the computation of the averages. To obtain accurate histogram of the distributions of the 
order parameters, however, we chose typically 5 times more MC steps. 

5.2.2 Order parameters, marginal measures, and spin correlations 

To confirm our analytical prediction and to estimate finite-size effects we compute the order parameters 
for various values of M,T and N, and report the results in Fig. [TO] Those figures clearly show that the 
analytical prediction, black curves, are consistent with the numerical results for sizes as small as N = 10. 
The phase transitions are well captured. In particular, the phase transition to the ME phase is clearly 
seen in the behaviour of d (right lower panel). This agreement show that our analytical findings, derived 
in the infinite- N limit, are numerically accurate even for small sizes. 

We have also computed the histograms of the single-configuration probabilities p s , with the results 
shown in Figs, fill and ITHl Due to the symmetry of the ISM the target probabilities p s depend only on 
the number of, say, —1 spins in the configuration s, which we call N~. We checked that this symmetry 
is approximately recovered in the histograms despite the noise introduced by the Monte Carlo sampling. 
Therefore we show only one among the p s with the same N~ in those figures. 

Fig. [Til shows the histograms corresponding to different N s , for N = 8 and T = 0, and two values of 
the number M of constraints. Two characteristic shapes of the histograms emerge. One looks like roughly 
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Fig. 10 Order parameters q (left) and d (right) of the ISM with H = 0.5, computed from Monte Carlo simulations and 
plotted vs. a (upper row) and 7 (lower row). Analytical predictions are shown with the black curves. The phase transition 
from small to large r can be best guessed in the lower, right panel. Finite-size effects seem to be stronger for larger a. 



Fig. 11 Histograms of single-configuration probabilities p s for N = 8 and f = (I; the numbers N s of negative spins in the 
configurations are indicated on the figure. The left panel corresponds to M = 30, and the right panel to M = 80. 


Gaussian, while the other typical behaviour resembles an exponential distribution. These shapes are in 
very good agreement with the two possible functional dependences of p s (p s ) upon p s given in Fig. [3] A 
larger number of Gaussian-like histograms is found for M = 80 than for M = 30, implying that more 
target probabilities are learned in the former case than in the latter, as expected from the theory. The 
size and sample dependence of the histograms of po (corresponding to the configuration 1 with N~ = 0 
minus spins) are examined in Fig. 1121 The corresponding target probability pi is the largest one of the 
ISM and is accurately learned for the values of M corresponding to the figures. From the left panel of Fig. 
m we see that our analytical prediction regarding p s (p s ) being centered in p s — 1 /M, becomes more and 
more accurate as the system size increases. An excellent agreement with the the prediction is reached for 
N = 10. Sample-to-sample fluctuations of the histogram of pi are shown in the right panel. The choice 
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Fig. 12 (Left) System-size dependence of histograms of po (corresponding to the configuration 1 all plus spins), for fixed 
a ~ 0.598 and r = 0. For each histogram, the rightmost dashed line show the location of the target value pi, while the 
leftmost dashed line shows pi — 1/M. (Right) Sample-to-sample fluctuations of the histogram of po for N = 8 , M = 80 and 
r = 0. The dashed vertical line represents the target value. 


of the constraints produces moderate fluctuations in the location of the peak height of the Gaussian-like 
histograms. 

Last of all we show in Fig. [13] the histograms of multi-spin correlations, for specific subsets of spins. 
We consider in particular 


c 12 


^/PsSlS2, Cj23 — S1S2S3- 

s s 


(117) 


We see that the predicted values of c are in fair agreement with the target distribution values, but both 




Fig. 13 Histograms of pairwise (ci 2 , left) and triplet (C 123 , right) correlations for a system of N = 8 spins. Vertical lines 
indicate the target distribution values. Each histogram corresponds to one sample of M = 80 observables. 


the thermal and the sample-to-sample fluctuations are rather large, compared to the results of Fig. [T21 


6 Conclusion 

In this paper we have investigated the properties of the space of probability distributions (over a large 
set of discrete variable configurations), constrained to reproduce many average values of observables, 
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computed according to a target distribution p. For zero tolerance E = 0, the space of admissible distri¬ 
butions p defines the version space. The version space contains many distributions of interest, in addition 
to the target distribution p , such as the center-of-mass distribution, (p), which is the flat average over 
all admissible distributions, and the maximum entropy distribution, Pme- In the case of finite tolerance 
E > 0 the version space acquires a probabilistic meaning. We introduce a probability measure p over 
the space of distributions, giving more weight to the distributions p in good agreement with the target 
values of the observables. The measure p may furthermore be biased to favour distributions p with large 
entropies S(p). To do so we introduce a multiplicative factor in the measure p 1 growing exponentially 
with r S(p). The coefficient r acts as an inverse temperature: r = 0 allows us to find back the unbiased 
measure, while, for r — > oo, the measure becomes fully concentrated around Pme- Varying the value 
of r allows us to understand how effective is the maximum entropy principle to approximate the target 
distribution p. 

To compute analytically the volume of the version space and its main properties we have assumed that 
observables were quenched random variables, and we have ignored correlations between different values of 
the observable in different variable configurations. This assumption is not realistic, as physical observables 
are generally rather smooth functions of the configurations. As a result of this simplifying assumption, we 
have been able to compute the typical distances between the target, the centre-of-mass, and the maximum 
entropy distributions, as well as the typical fluctuations around the centre of mass. The calculation was 
done within the replica symmetric framework, and we have checked that our solution was locally stable 
against replica-symmetry-breaking fluctuations (replicon modes). A major outcome of our calculation is 
the notion of learning edge, which separates learned from as-yet—unknown configuration probabilities. 
Learned probabilities correspond to large target probabilities, while the probabilities of configurations 
associated to small target probabilities remain unknown. As the number of observables increases the 
learning edge moves to smaller and smaller probability values, implying that learning proceeds. The 
decrease of the learning edge is not generally accompanied by a decrease of the distance between the 
centre-of-mass and the target distribution, unless it hits the probabilities contributing most to that 
distance at specific critical points. 

Numerical simulations were performed to confirm our asymptotic and analytical predictions, and to 
quantify finite-size effects. Due to the exponential growth of the dimension of the version space with the 
number N of variables we are limited to very small sizes, in practice N < 10. Nevertheless results are 
in good agreement with our analytical results and finite-size effects do not seem to alter our asymptotic 
results in a significant way. In particular our predictions for the existence of a learning edge and for 
the functional forms of the marginal measures over the probabilities of configurations are remarkably 
confirmed by the numerics. 

We have also studied the role of the entropic bias for a given number of measured observables. Our 
calculation shows that the maximum entropy distribution is not closer to the target distribution than 
any other randomly chosen distribution in the version space. This negative result is due to our choice of 
fully uncorrelated observables. It is quite likely that introducing an entropic bias becomes efficient and 
improves learning performances if we require that both the observables and the target distribution satisfy 
some smoothness criteria. Some preliminary results, obtained without assuming that the observable values 
are uncorrelated from configuration to configuration, are reported in [9] and seem to support this guess. 
Unfortunately the analytical calculations with ‘correlated’ observables are very involved, and we have not 
been able to make substantial progresses so far. The theoretical importance and the practical relevance 
of the issues addressed here, such as how the number of constraints should depend on the smoothness 
of the observables and of the target distribution, or whether the phase transitions at distinct steps of 
the learning process found in the random uncorrelated case studied here exist in the correlated case, 
are strong incentives to extend this study to realistic observable ensembles. Another valuable direction 
for further research would be to search for ‘good’ sets of observables. While we have restricted here to 
the case of M independently drawn observables, it would be interesting to optimize the choice of those 
observables to make the volume of the version space as small as possible, and to get as close as possible 
to the target distribution. In this regard, we have extended the present analysis to the case of a finite 
number of replicas, n ^ 0, the result of which will be reported soon. 
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A Lesson from no-constraint case for finite r 

If r is finite in the no-constraint case, the volume is written as 

/ ioo r oo 

dA e 71 / 

-ioo JO g 

Now, the saddle point with respect to p s is quite simple 

_ r> — p~ A/r 

Ps p c 


-Ap s —rp s log p$ 


(118) 


(119) 


This is the origin of the solution assumed in sec. l4-3.ll Assuming the saddle-point approximation is correct, we can write 
the integration as 


J dp a e ~ A Pa- r Pa logPo m 


1 A 

3 r . e re- l ~T 


r 


and the log volume is 

F = A + 2 n jre^-r + I (log(27r) - 1 - 4 _ l og A | . 
Taking a variation with respect to A, we get 


1 , a 1 

—— = e~ 1 ~T + -. 

2 jv 2 r 


( 120 ) 


( 121 ) 


( 122 ) 


We know A = 2 N at T = 0. Thus, it is natural to think that the saddle point p = e~^~ A ^ r takes meaningful value only for 
r > 2 N , and otherwise p rapidly becomes zero and the result comes back to the case r = 0. Hence, eq. (11221) is satisfied in 
the leading scale as 2~ N = e~ 1—A / r . Hence, A = (A log 2 — 1)T and the term 1/2 r becomes subleading one. Substituting 
this, we get 


F ~ (N log2)T — 2^ ( Alog2 + logr+-(l-log27r)) . 


(123) 


The leading scale of the log volume is thus dominated by r. 


B Stability analysis 

The replica generating function is given by 


(j>(n) = 

1 T QabQ'ab - ^ Tr log (E + Q) + 11 n log E + log 0 a , 

a<b s 

(124) 

Here Tr denotes the trace of matrix. We write the exponent in O s in eq. 139b as 


fs({p a a }a) = 

~ E A ^Ps QabiPs ~ M(.P b s ~Ps)-FJ2ps PS- 

a a <b a 

(125) 

Let us consider some small fluctuations around the RS saddle-point order parameters: 



Qab — Q^b + x ab> Q'ab ~ Q'at> S + ^ x ab- 

(126) 

Then, 

\ E Q°-bQ' ab ~ 1 E QabQ'Ib S + E X o-b x ab , 

a<6 a<6 a<6 

(127) 


Tr log (E + Q) « Tr log (e + Q RS ) - 1 Tr (Ax) 2 . 

(128) 
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where we define A = (E + Q as ) 1 , and 


fs = / RS £ o.bPsP b s = /f S + Af*> 

a<b 


with p% = Ps ~ Ps- Thus 


log 9s » logTre^ (l + Af. + ±Afty « log© RS + I <4\/ 2 ) - 1 (z\/ s ) 2 , 


where we have introduced the notation 


<->. = i Tre/?s (-)■ 


(129) 


(130) 


(131) 


Note that we have omitted all the first-order terms, which vanish at the saddle point. Thus we get (f){n) = </> RS + A with 


2A = 2 J2 X o.b&ab + Y Tr (Ax) 2 + X o.bXcdY^ ((PsPsPsPi) s - (PsP b s) (p C sPs) g ) • 

a<6 a<bc<d s 

We write A aa = X and A a b = Y (a ^ b). and those components are easily calculated 

E + Q + (n-2)R R 


X = 


(E + Q -+- (n — 1)R) (T1 + Q — R) 
1 


Y = - 


X + (n - 1 )Y = 


(E + Q + (n — 1)R) 


, X - Y = 


(E + Q + (n- 1 )R) (E + Q-R)’ 
1 


(E + Q-R) 


(132) 

(133) 

(134) 


Eq. (11321) is a quadratic form with respect to {x a b} and {x a b} and can be written as 2 A = V t GV, where V is a column 
vector with components {# a b} and {x a b}, and V* is its transpose. We now want to determine whether the Hessian matrix 
G has unstable modes. The most likely candidate lies in the replicon eigenspace, which is spanned by the vectors whose 
components (a? a &>£ab) depend only on whether their replica indices are equal or different to two fixed values a = 0 and 
b = 77 [ 8 ]- In the replicon space, diagonal (a = b) fluctuations tend to be irrelevant since other (transverse, longitudinal) 
modes span the n-dimensional space (x a a,x a a)- Hence we set hereafter x a a = x a a = 0. 

We arrange the components of V by ordering x as 


V = (X12,X13, ' ■ ■ , Xin, X23, 
and x as well. The Hessian G has the following form 

/ S T -TU -U 


i—l,n)j 


(135) 


G = 


T - T U---U 


T -T U---U 


T -TU -U 


(136) 


where I is the identity matrix, and 


s-£(<c«4>V <**>!)' 

S x 7 

f =J2(((Ps) 2 PsP C s) -(psP b s) ( PsPs), 


(( PsPsPsP d s) s - (psPs) s (psP d s 
s 

S = M(X 2 + Y 2 ), T = MY(X + Y), U = 2MY 2 . 


(137) 

(138) 

(139) 

(140) 


Let us find the eigenvectors of this matrix. The first eigenvector V\ is obtained by assuming x a b = a and x a b = a for any 
A, u. The upper half (in the matrix drawn in eq. (11361) ) of the eigenvalue equation GV\ = X\V\ gives 


and the lower half yields 


S + 2(n — 2)T + — (n — 2)(n — 3 )U ) a + a = Aia, 


a + ( S + 2(n — 2 )T + —(n — 2)(n — 3)U ) a = Aia. 


(141) 


(142) 


These equations have the eigenvalue Ai with non-vanishing a, a, which must satisfies the following relation 

A? - (Ci + Ci)Ai + (CiCi - 1) = 0, 


(143) 
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where Ci = S + 2(n — 2 )T + (n — 2)(n — 3 )U /2 and C\ = S + 2(n — 2 )T + (n — 2)(n — S)U /2. This equation says that this 
mode spans a two-dimensional space. 

The next type of solution V 2 is obtained by selecting a replica index, say, 0 = 1. This solution V 2 has x a b = b and 
x a b = b when A or v is equal to 1, y^ u = c and y^ u = c otherwise. The first row of the eigenvalue equation GV \2 = A 2 V 2 
gives 

(S + (n - 2 )T)b + ^(n - 2)T + 1 (n - 2)(n - 3 )u\ c + b = A 2 b, (144) 

and the first row of the lower half of matrix G in eq. (1136lt) yields 

b + (,§ + (n - 2 )f) b + ^(n - 2 )f + i(n - 2 )(n - 3)f/) c = A 2 S. (145) 

We now impose the orthogonality condition over V 2 and V\. This leads to 

(n — 1)6 + — (n — l)(n — 2)c = 0, (146) 

and the same relation holds for 6 and c. Substituting these conditions, we get 

(S + (n — 4)T - (n - 3)U) b + 6 = X 2 b , (s + (n - 4)f - (n - 3)C/) 6 + b = \ 2 b , (147) 

which leads to 

Al - (C 2 + C 2 )A 2 + (C 2 C 2 - 1) = 0, (148) 

where C 2 = S + (n — 4)T — (n — 3)U and C 2 = -S' + (n — 4 )T — (n — 3)U. As there are n possible choices for the replica 

index 6 and two eigenvalues/eigenvectors V\ and V 2 for each choice, this particular subspace is of dimension 2 n. 

The third mode V 3 is obtained by treating two replicas 0 , lj as special ones. This solution V 3 has xq u = d and xq u = d , 
XQ a = e and x u ;a = e, and x a b = / and x a b = f otherwise. The orthogonality condition with V 2 is 

b(d + (n — 2)e) 4 - c n — 2)e + — (n — 2)(n — 5)/^ = 6 (d + (n — 4)e — (n — 5)/) = 0, (149) 

where we use eq. J146D . The one with V\ is 

d + 2(n — 2)e + i(n - 2)(n - 5)/ = 0. (150) 

These relations mean 

e =-, i(n - 2)(n - 5)/ = -d- 2(n - 2)e = d. (151) 

n — 2 2 

Similar relations hold for the hat variables. According to these relations, the first rows of the upper and lower halves of the 
eigenvalue equation GV 3 = X 3 V 3 yield 

dS + 2(n — 2)eT + — (n — 2)(n — 5)/f/ + d = (S — 2 T + U)d + d = A 3 d, (152) 

d + (S - 2T + U)d = A 3 d, (153) 

Thus we obtain 

A|-(C 3 +C 3 )A 3 + (C 3 C3-1) = 0, (154) 

where C 3 = S — 2T + U and C 3 = S — 2T -) -U. This solution spans a n(n — 3)-dimensional space, implying that no eigenmode 
are left. 

For the stability of the saddle point, all of the eigenvalues must be non-negative. This condition corresponds to 

Vi, Ci Ci < 1. (155) 

This condition takes into account the fact that Q' ab is originally a pure imaginary variable, which means that SQ a i)SQ' ab 
is associated with a multiplicative factor i and SQ r b SQ r b acquires a factor —1. Hence, if we change variable from Q' ab to 

i Q f a bi the diagonal block in the lower half part of G gets a factor —1, and the off-diagonal part becomes il , which leads to 

the positivity condition in eq. (I155D . 

The replicon mode corresponds to A 3 and V 3 . Thus, the stability condition is 

(S - 2T + U)(S - 2T + U) < 1. (156) 

A straightforward calculation shows that 

S — 2T + U = M(X — Y ) 2 = - — -—. (157) 

y ’ (E+Q-R ) 2 y ’ 

We now turn to the calculation of S — 2T + U. Within the RS assumption the average ( 0 ) s , where O = O a O^ • • • O c , can 
be expressed as 

{0) -T5^:! d ’ K{0 ‘ )x - " {0 ‘ )x - 


(158) 
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where X s was defined in eq. rm . Thus, 


S-2T + U = J2 {{(P a s) 2 (Ps) 2 ) s - 2 ((P a s ) 2 p b s p c s ) b - (plPlPlpf)} 



(159) 


Noticing that 



(160) 


the stability condition may be written, in the n —>■ 0 limit, under the form shown in eq. d 111 D . 
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